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Abstract 

We study how to learn multiple dictionaries from a dataset, and approximate any 
data point by the sum of the codewords each chosen from the corresponding dic¬ 
tionary. Although theoretically low approximation errors can be achieved by the 
global solution, an effective solution has not been well studied in practice. To 
solve the problem, we propose a simple yet effective algorithm Group K-Means. 
Specifically, we take each dictionary, or any two selected dictionaries, as a group 
of iC-means cluster centers, and then deal with the approximation issue by mini¬ 
mizing the approximation errors. Besides, we propose a hierarchical initialization 
for such a non-convex problem. Experimental results well validate the effective¬ 
ness of the approach. 


1 Introduction 

AT-means is a well-known clustering algorithm and has been widely applied in numerous applica¬ 
tions. The algorithm aims to partition N P-dimensional points into K clusters in which each point 
belongs to the cluster with the nearest mean. Let X = {xi, • • • ,XAr} C be the dataset and 
V = {di, • • • , dif} C be the cluster centers. The clusters V are learned by minimizing 

^ mm||x-dfc||^, (1) 

where || ■ ||p denotes the Ip norm. The objective function can be iteratively minimized ||2l. Each 
iteration involves an assignment step and an update step. In the former step, the nearest center of 
each point is calculated, while the latter computes the mean of the points assigned into the same 
cluster. Each point can be represented by the index of the nearest cluster center, which requires 
[logz ATlQbits. 

The assignment of each point requires 0{KP) time cost. When the number of clusters is huge, it is 
prohibitive to perform the exact AT-means due to the high time cost. To solve the scalability issue, 
we can split the P-dimensional vector into M subvectors as in ||4l|5][8l. Then the standard AT-means 
algorithm is applied on each subvector, resulting in cluster centers but with 0{KP) assignment 
time cost. The number of bits required to represent each point is log 2 = M log 2 AT. 

Recently, multiple dictionaries are proposed in IHOITI. Each dictionary contributes one code¬ 
word and the summation of these codewords is used to approximate one data point. Let = 


’In the following, without confusion we omit the [ z\ operator, which represents the smallest integer that is 
not smaller than 2. 
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Algorithm 1 Iterative optimization in gk-means 

Input: Dataset X = {xi, • • • , xjv}, number of dictionaries C 
Output: Multiple dictionaries c G {1, • • • , C}} 

1: Initialize multiple dictionaries c G {1, • ■ • , C}} by Sec. 12.3.11 

2: Initialize the assignments {fci, • • • ,kc} for each point x G A by Sec. 12.3.21 

3: while ! converged do 

4: Update multiple dictionaries by Sec. 12.21 

5: Compute assignments by Sec. 12.II 

6: end while 


{diG-- ,<1^} C be the c-th dictionary with c G {1, • * • , C}. The objective is to minimize 
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( 2 ) 


to learn the dictionaries In [|9l, the problem is studied on the subvector, and can be seen as 

a special case where the subvector is equal to the full vector or the number of subvectors is 1. This 
problem is also explored in inni, but with an additional constraint to make the scheme more suitable 
for Euclidean approximate nearest neighbor search. To represent each point, we need C log 2 K bits 
to indicate which codewords are selected from all the dictionaries. 


It is easily verified based on ||9l that a global optimal solution can give a lower distortion error 
than llSlIll under the same code lengths. However, it is very challenging to obtain the global solution 
due to the non-convexity of the problem. In 13, an intuitive recursive algorithm is proposed, but the 
complexity is exponential with the number of dictionaries, and it is less scalable with a larger C. 

In this paper, we propose a simple yet effective algorithm, named Group K-Means (shorted as 
gk-meansfl to solve the problem. Specifically, we take each dictionary or any two consecutive 
dictionaries as a group of iT-means clusters, and solve the assignment with a linear time complexity 
to the number of dictionaries. Due to the non-convexity of the problem, we propose a hierarchical 
scheme composed of multiple stages to initialize the dictionaries. Each stage solves a subproblem 
where a portion of the entries in the dictionaries are enforced to be 0. Experimental results have 
verified the effectiveness of our approach. 


2 Group iA-means 

To minimize Eqn. dU, we iteratively perform the assignment step and the update step as shown 
in Alg. [T] and introduced in the first two subsections. The former computes the assignments 
(fci, • • • , fee) of each point x based on the dictionaries and the assignments in the previous iter¬ 
ation. The latter updates the multiple dictionaries based on the current assignments. Besides, the 
initialization of the dictionaries and the assignments is introduced in the last subsection. 

2.1 Assignment Step 

2.1.1 Order-1 Group Assignment 

In Eqn. (|2]i, each point is approximated by the summation of multiple codewords, and each codeword 
is chosen from a different dictionary. We first take each dictionary as a group of clusters on the 
residual. Eor the ci-th dictionary, the residual is defined as 

y = x-y]d^^. (3) 

c/ci 

The assignments {fee, c ^ ci} are from the previous iteration. We can also assert that the quantiza¬ 
tion error between y and any codeword G equals the distortion error between x and the 

^The algorithm is similar with but we conduct the research independently and apply it in data/feature 
compression and image retrieval. 
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Algorithm 2 Assignment Step in the current iteration 

Input: X = assignments {kc}^^i for each x G A in the previous iteration. 

Output: {A:c}^i for each x G A in the current iteration 
1: Compute } according to Eqn.[8] 

2: for X G A do 

3: Compute }ci=i ka =i according to Eqn.|2] 

4: while ! converged do 

5: Compute ci G {1, • • • , C} by Sec. 12.1.11 

or 

Compute (fcci, fccali kc 2 = (kc^ + 1)%C, ci G {1, • • • , C} by Sec. 12.1.21 

6: end while 

7: end for 


combination of the codewords from all the dictionaries, i.e. ||y — = ||x — Hi- Thus, 

the index fed can naturally be chosen by 


k 


Cl 


. 1 
arg min — 

kc-^ 2 


l|y- 



(4) 


Since the assignment k^^ depends on the assignment kc where c ^ ci, we propose to iteratively 
compute the assignments over all groups until the assignments do not change. If the number of 
iterations needed to scan all the groups for one point is S, it requires 0{SKCP) multiplication. To 
reduce the computation cost, we substitute Eqn. 0 into Eqn. (|4|l, and have 



-x^d^ + y iXj} ^di 


c/ci 
nCi ,C 

y ^ ikc 






const 


const 


where const is a constant to kc^ and 


S 


Cl 

^ci 


rpCi^C2 

kci )fcc2 





Cl ^ C2 

Z f^ci >^02 

Cl = C 2 , k^ 

0 

otherwise 


(5) 

( 6 ) 


(7) 

( 8 ) 


The second item in Eqn. (|6ll is independent of the point x, and thus we can pre-compute a lookup 
table } before scanning all the points. Eor each point, we compute {S'^^ } before scanning 

all the group centers. Then, the computation of Eqn. (|4]i only requires 0(C) addition by Eqn. (|6]l, and 
the complexity of multiplication is reduced to 0{KCP) from 0{SKCP). Since each dictionary 
is referred to as one group of clusters, we call this scheme Order-1 Group Assignment (shorted as 
OiGA). 


2.1.2 Order-2 Group Assignment 

Eurthermore, we propose the scheme Order-2 Group Assignment (shorted as O 2 GA), where any two 
dictionaries can be taken as a group of clusters. 

Eor the ci-th and C 2 -th dictionaries, we obtain the residual similarly as Eqn. Q by 

y = x- y d^^. (9) 

C^C1,C/C2 
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Each pair of (fc^, kc ^) can construct a cluster center . Thus, we compute the assignment 

as 


{k. 


Cl, AcJ = arg min :^||y-d“^ 

feci -kco ^ 1 



2 
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= arg mm 
kc^ :kc2 


{SI 


^ )+ {si^ + yu k 


/ y fccj ,fc, 

C5^C2 


C^Ci 


) + T, 


5^C' 


( 10 ) 

( 11 ) 


where 5'^^ and are defined in Eqn. (I7]l and Eqn. (l 8 ]l, respectively. To compute (fTTl i. we 

need only 0{K^ ) addition rather than multiplication. The formulation is general for any ci ^ C 2 . 
To reduce the time cost, we only apply it on two consecutive dictionaries. 

Discussion. The assignment algorithm is illustrated in Alg. |2]for OiGA and O 2 GA. Note that 
the iteration in Line 0] is important because the assignments {fcc,c ^ ci} or {fcc,c 7 ^ Ci,c 7 ^ C 2 } 
may change after all the dictionaries are scanned once, and the assignment k^ or (Ad, Aca) can be 
re-computed to further reduce the distortion error. 


The scheme O 2 GA can be straightforwardly extended to 0„GA. If n equals C, the global optimal 
assignment can be obtained. However, the time cost of all addition in Eqn. (fTTl i is 0 (iT") and is 
exponential with n. Thus, we set n = 1,2 experimentally. Besides, the complexity is linear with 
the number of dictionaries C, and is much lower than the exponential complexity of Q . Intuitively, 
the distortion induced by O 2 GA should be lower than or equal to that by OiGA. However, this 
assertion cannot be guaranteed. One reason is that in the iterative optimization, the dictionaries are 
also optimized and become quite different even after one update step. Given different dictionaries, 
we cannot assert the superiority of O 2 GA over OiGA. However, in most cases, the superiority is 
demonstrated experimentally. 


2.2 Update Step 


The objective function in Eqn. (|2]i is quadratic w.r.t. the multiple dictionaries. Thus, the dictionaries 
can be updated by setting the derivative w.r.t. the dictionaries as 0. In the following, we derive the 
equivalent results from the mean of the centers, which is much similar with the traditional iT-means. 

Given the assignment {Ac} for the *-th point, we first introduce an indicator function 


J1, fcc = A for Xi 
\ 0 , otherwise. 


( 12 ) 


It represents whether the assignment of the i-th point on the c-th dictionary is k. Then, the residual 
in Eqn. Q can be written for the i-th point as 

y, =x, - ^ rl^dl. (13) 

c^ci,k 


Next, we update the center by the mean of the residuals within the cluster, i.e. 

,ci _ G,/ciyt Cl / Q 

Substituting Eqn. (fTsT i into Eqn. (fl^ and simplifying the equation, we have 


i k,c \ i / 


(14) 


(15) 


Since Eqn. (flSl l holds for any ci G {1, • • • , G} and ki G (1, • • • , AT}, the matrix form is W = DZ, 
where 


W = [w} • • • • • • wf • • • w^] (16) 

e {I,-- - ,K],ci G jl,--- ,G} (17) 

i 

D=[d} ••• d], ••• df ••• dg], ( 18 ) 
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The element of Z G ^kcxkc in the ((ci — 1)K + A:i)-th row and the ((c 2 — 1)^ + A: 2 )-th column 
is 


r/Cl-,C2 

^ki,k2 


E 


r 


Cl 

i,ki 


r 


C2 

i,k2 


fci,fc2 e {i,--- , k }, ci , c 2 e {I,-- - ,(7}, 


i 


(19) 


which can be interpreted as the number of points whose assignments on the ci-th dictionary and on 
the C 2 -th dictionary are ki and ^ 2 , respectively. Then, we can solve the dictionaries D by the matrix 
(pseudo)inversion. 


2.3 Initialization 

Since the problem of minimizing Eqn. (|2]i is non-convex, different initializations fall into different 
local minima, and thus the initialization is quite important. 


2.3.1 Initialization of Dictionaries 


One direct method is to randomly sample the points from the dataset to construct the dictionaries as 
in 0 . Empirically, we find this scheme works well for cases with a small number of dictionaries 
(e.g. C = 2 as in 0), but the performance degrades for a larger C. In this work, we propose two 
initialization schemes. The first is based on the traditional iT-means and the second is a hierarchical 
scheme based on |j8]|9)- 

iT-Means-based initialization is to run the traditional iT-means algorithm on the residual repeat¬ 
edly. Specifically, we first set the residual by the original dataset, i.e. y = x, Vx G X, and the index 
of the to-be-initialized dictionary as c = 1. Then, the iT-means is performed on {y} to obtain the 
K cluster centers as the first dictionary {d^, • • • , d^}. After this, we update the residual of each 
point by y ^ y — d^ , where kc is the assignment of y, and c c+ 1. The iT-means algorithm is 
repeatedly run to obtain the second dictionary. By alternately updating the residual and performing 
the AT-means on the residual, we can get all the dictionaries as an initialization. 

Hierarchical Initialization can ensure that in theory the approach performs not worse than the 
vector partitioning approaches (e.g. 0) under the same code length. The basic idea is to solve 
log 2 Qj subproblems where different constraints are applied on the multiple dictionaries. Let D° = 
[df ••• dy and G {0,l}^,||b‘^||i = 1. The index of 1 in b'^ represents which codeword is 
selected. We can rewrite the objective of Eqn. @ as 





bi' 


> min 

X - [Di • 

• D^] 

b'^ 



Before introducing the approach in a general case, we take C = 4 as an example. Eirst, we minimize 
Eqn. ( l20l i with the dictionaries D = [D^ D^] constrained to be 


D = Ri 
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0 

0 


0 

2.1 


0 


0 

0 

3.1 


0 

0 

0 


, R^Ri — I, 


( 21 ) 


where Ri is a rotation matrix and D™’^, m G {1, 2, 3,4} is a real matrix of size P/4 x AT. The 
notation denotes the block in the u-th row and {(u — 1)2®“^ -f u)-th column of D in the s-th 
subproblem. This subproblem is studied in 0, and can be solved by alternating optimizations with 
regard to Ri, and {b'^}. We initialize Ri by an identity matrix and {D™d| randomly 

choosing the data points on the corresponding subvector. The optimal solution is used to initialize 
the second subproblem where the objective function remains the same but the constraint is relaxed 
to be 


D = R2 



0 

0 



,R2^R2 =1. 


( 22 ) 


^We assume C is a power of 2. Meanwhile, the dimension P is assumed to be divisible by C in the 
following. This algorithm can easily adapt to general cases. 
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The initialization of the second subproblem is as follows 


R 2 = RJ,D^^ = 


* 

Q 

1 _ 1 

D 12 _ 

5 ^2 ~ 

0 

D 2.1 _ 

7 ^2 — 

0 

D 2.2 _ 

7 -*^2 — 

1 - 1 

* 

0 

Q 

1 _ 1 


( 23 ) 


where the asterisk * denotes the optimal solution. This subproblem is studied in ||9], and it is verified 
that the distortion can be lower than ck-means i]. We solve the subproblem in a similar manner 
with ||9l, except that the assignment step is replaced by that in Sec. 12. II Finally D is initialized by 


= R^ 


0 

,d 2 = r; 

^2 

0 

,d® = r; 

0 

L -^2 J 

u 

II 

to * 

0 


(24) 


In summary, each subproblem enforces different levels of restrictions on the multiple dictionaries. 
In the first subproblem, the restriction is most severe, and most of the elements are constrained 
to be 0. Then it is relaxed gradually in the subsequent subproblems. Generally, the s-th (s S 
{1, • • • , log 2 C}) subproblem is constrained by R^^Rs = I and 


rDi4 


D = R^ 


D 1 . 2 “ 


D 


c/ 2 " 


D 


c/2" 


(25) 


The initialization of the (s + l)-th (s < log 2 C — 1) subproblem from the optimal solution of the 
s-th subproblem is R^+i = Rs* and 


■L's+l — 


^s+1 — 


—I,!?"* 

" 0 

0 




t-\2u,v — 2 
. s 

The final initialization for Eqn. dU is 


,ifue{l,-- - ,C/2"},ue{2"-^ + l,-- - ,2"}. 


(26) 


(27) 


_ "D * 

^ C 


^log 2 C 

0 


, 1 < u < 67/2; D’'= Ri* 


logj C 


0 

P,2,t)-C/2’‘ 

^log 2 C 


, 67/2-f 1 < u < 67 


2.3.2 Initialization of assignments 

In Sec. 12.11 the assignment is based on the residual defined in Eqn. ([2l and Eqn. (|9]l for OiGA 
and O 2 GA, respectively. To initialize the assignment, we only use the initialized to compute the 
residual. Taking OiGA as an example, we compute the residual by y = x — d| for the ci-th 

dictionary. After iterating ci from 1 to 67, we refine the assignment by Alg/Iir This is also applied 
to encode new data points after obtaining the dictionaries. That is, we initialize the assignments first 
and then refine them by Alg.|2] Similar ideas can be applied to initialize the assignments for O 2 GA. 


3 Experiments 

3.1 Settings 

We conduct the experiments on three widely-used high-dimensional datasets: SIETIM 0 , 
GISTIM 0, and MNIST Ig). SIETIM has 10^ training features, 10^ query features, and 10® 
database features. Each feature is a 128-dimensional SIET descriptor. GISTIM has 5 x 10® train¬ 
ing features, 10® query features and 10® database features with each being a 960-dimensional GIST 
feature. MNIST contains 60,000 database images (also used as the training set) and 10,000 query 
images. Each image has 28 x 28 pixels, and we vectorize it as a 784-dimensional feature vector. 

The accuracy is measured by the relative distortion, which is defined as the objective function of 
Eqn. (I2]i with the optimized solutions divided by 11 ^ li- This indicator is important in data 

compression, approximate nearest neighbor (ANN) search 0, etc. The accuracy is better with a 
lower distortion. Due to the space limitation, we report the experiments on the application of ANN 
search in the supplementary material. 
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Figure 1: Relative distortion on the training set with different numbers of iterations. The first row 
corresponds to SIFTIM; the second to GISTIM; and the third to MNIST. 


We compare our approach with ck-means IS) and optimized Cartesian fc-means(ock-means) ||9l 
under the same code length. Let Mck and Mock be the numbers of subvectors in ck-means and ock- 
means respectively, and let Cock and Cgk be the number of dictionaries on each subvector of ock- 
means and the number of dictionaries of gk-means respectively. Then the code lengths of ck-means, 
ock-means, and gk-means are Mcklog 2 (iT), MockCock log 2 (Ff) and Cgklog 2 (iT), respectively. We 
set Mck = MockCock = Cgk to make the code length identical for all the approaches. Following 0 
a, we set K = 256 to fit the index by one byte, and Cgk = 4,8,16 to obtain the code lengths 
32,64,128, respectively. As for the initialization of gk-means shown in Sec. 12.3.11 30 iterations 
are consumed both in each A:-means for the A:-means-based initialization and in each subproblem 
for the hierarchical initialization. The number of iterations in all the approaches is at most 100 or 
the iteration stops if it reaches convergence. It is expected that the performance gains with a larger 
number of iterations. Here we just fix the maximum iteration number for the comparison purpose. 
To minimize Eqn. (|2]), a multiple candidate matching pursuit (MCMP) algorithm is proposed on 
each subvector in 121. Since MCMP is exponential with the number of dictionaries, we set Cock = 2 
by default as suggested in Q. We also run MCMP on the full vector with other values of Cock and 
compare it with gk-means. 

The term gk-means(a, b) m used to distinguish different assignment approaches in Sec. 12.11 and 
different initialization schemes in Sec. 12.31 a = 1, 2 to represent OiGA and O 2 GA, respectively; b 
= r, k, h to represent the random initialization, the k-means-based initialization, and the hierarchical 
initialization, respectively. 
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Table 1: Relative distortion (xlO on the databases with different code lengths. 


ock-means ck-means 




(2, h) 

(2,k) 

(2,r) 

(l,h) 

(l,k) 

(l,r) 




32 

19.68 

20.47 

22.22 

20.63 

21.59 

22.48 

21.14 

23.78 

SIFTIM 

64 

11.01 

13.19 

19.46 

11.56 

14.10 

16.16 

13.37 

14.45 


128 

4.54 

7.44 

16.92 

5.08 

8.25 

12.62 

5.85 

6.71 


32 

40.56 

41.24 

41.58 

42.73 

42.59 

44.32 

41.64 

43.19 

GISTIM 

64 

32.11 

33.86 

35.94 

33.92 

34.75 

36.76 

33.38 

34.32 


128 

24.48 

26.87 

29.12 

25.93 

27.56 

29.29 

25.18 

25.69 


32 

14.25 

16.52 

17.62 

15.29 

18.18 

22.43 

15.53 

16.76 

MNIST 

64 

9.44 

14.12 

16.48 

10.36 

14.88 

18.38 

10.83 

11.81 


128 

5.91 

11.63 

14.28 

6.55 

12.37 

16.11 

7.28 

7.57 


3.2 Results 


The relative distortions on the training sets are illustrated in Fig. [T] with different numbers of itera¬ 
tions. In each iteration, we report the relative distortion after Line|5]in Alg.[T] Since ock-means and 
ck-means adopt a similar alternating optimization algorithm, we also collect the relative distortions 
before the end of each iteration. Certain curves stop before 100 iterations because of convergence. 

OiGA vs O2GA. In most cases, O2GA is better than OiGA except for certain cases, e.g., the com¬ 
parison of gk-means(l, r) and gk-means(2, r) in Fig. [T|(b). This may be because in the iterative 
optimization, different assignment algorithms generate different assignments, and the dictionaries 
are optimized in different directions. With different dictionaries, we cannot guarantee the superior¬ 
ity of O2GA over OiGA. In the cases where the O2GA is better, the improvement varies among 
different settings. For example in Fig.[T](a) and Fig.[T](d), the improvement of gk-means(2, h) over 
gk-means(l, h) is much more significant, while in Fig.[T](b) and Fig.[T](c), the difference is minor. 

Comparison of initializations. Generally, the hierarchical initialization is the best; the second is 
the fc-means-based, and the worst is the random initialization. Although we cannot guarantee this in 
theory, this observation is true under all the settings in practice in Fig. [T] 

Comparison with ock-means and ck-means. From Fig.[T] we can see that the random initialization 
is almost always inferior, while the hierarchical initialization can always lead to a lower distortion 
than ock-means and ck-means. This also implies that the initialization is quite important to such a 
non-convex problem. 

Similar observations can be found w.r.t. the relative distortion on the databases illustrated in Table[T] 

Finally, we compare the gk-means and the MCMP ||3 on the full vector (Mock = !)■ We evaluate 
the time cost on a Linux server with a CPU of 2660MHz and 48G memory. The experiment is 
conducted on SIFTIM, and the program runs in 24 threads to encode the 10® database points. 

The results are depicted in Table |2] Due to the high time cost, we cannot run MCMP with 8 dictio¬ 
naries. Thus, we estimate the time cost as follows. The dictionaries are trained by gk-means(2, h), 
and the time cost is collected on 240 database points with the MCMP algorithm since we deploy 24 
threads. The result is 2589.66 seconds and we multiply it by 10®/240 to estimate the time cost to 
encode the whole 10® database points. The results of gk-means are tested with the best hierarchical 
initialization. From the results, we can see that in terms of time cost both gk-means(l, h) and gk- 
means(2, h) scale well with the number of dictionaries while MCMP does not. This is because the 
time cost of gk-means is linear to the number of dictionaries while MCMP is exponential. Mean¬ 
while, the time cost of gk-means(l, h) is less than that of gk-means(2, h), because the complexity of 
the former is 0{K) while that of the latter is 0{K^). In terms of the relative distortion, MCMP is 
slightly better than gk-means(2, h) which is better than gk-means(l, h). 
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Table 2: Encoding time and the relative distortion (R.d) on the database of SIFTIM. 


2 


4 


8 


Time (s) 

R.d. 

Time (s) 

R.d. 

Time (s) 

R.d 

, 1 8.6 

0.30 

20.3 

0.21 

120.4 

0.12 

gk-means ^ 45.2 

0.29 

110.3 

0.20 

392.9 

0.11 

MCMP 12.2 

0.29 

723.3 

0.19 

10790250.0 

- 


4 Conclusion 

We proposed a simple yet effective algorithm, named as group A:-means, to effectively encode the 
data point. With the desirable low distortion errors, this approach can represent high-dimensional 
data points with less storage cost and facilitate data management. Future work includes applying it 
to applications, e.g., multimedia retrieval with inner product measurement and image compression. 


Appendices 


This appendix reports the experimental results of approximate nearest neighbor search, in which our 
group iT-means (gk-means) is compared with other approaches. 

Following Q, we compute the approximate distance between the query q G and the database 
point X G encoded as (fci, • • • , fee) by 


1 

2 



2l|q| 




+ 




(28) 


The hrst item is consistent and can be omitted during distance evaluation. The third item is inde¬ 
pendent of the query and can be pre-computed from the code (fci, • • • , fee) and the dictionaries. No 
original data is required for the computation of the third item. To compute the second item, we can 
evaluate all the inner products for c G {1, • ■ • , C} and k G {1, • ■ • ,K}. Then, each distance 
computation only involves 0{C) + 1 addition, which is comparable to ck-means lH. 

For each query, we compare it with every database point by Eqn. and rank all the points by the 
approximate distance. We take recall as the performance criterion to measure the proportion of the 
queries whose corresponding true nearest neighbors (by Euclidean distance) fall in the top ranked 
points. 


As studied in the paper, initialization is important for the minimization of the objective function and 
we only report the results with the best hierarchical initialization scheme. 

The results are shown in Fig.|2]on the three datasets with code length 32, 64,128. The suffixes in the 
legends denote the code length. We can see gk-means(2, h) almost always outperforms the others. 
The performance of gk-means(l, h) is better than ock-means on SIFTIM and MNIST, but is worse 
on GISTIM. The performance varies because the data distributions among the datasets are different 
and the numerical optimization does not achieve the theoretically optimal solution. The advantage 
of gk-means( 2 , h) over gk-means( 1 , h) is because the local minimum issue of gk-means( 1 , h) is more 
severe. Besides, the performance comparison for ANN search is quite consistent with the relative 
distortion on the database as shown in Table 1 in the paper. 
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Figure 2; Recall on the three datasets with different code lengths. 


References 

[1] Artem Babenko and Victor S. Lempitsky. Additive quantization for extreme vector compres¬ 
sion. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, 
Columbus, OH, USA, June 23-28, 2014, pages 931-938, 2014. 

[2] Christopher F. Barnes. A new multiple path search technique for residual vector quantizers. In 
Proceedings of the IEEE Data Compression Conference, DCC 1994, Snowbird, Utah, March 
29-31, 1994., pages 42-51, 1994. 

[3] Chao Du and Jingdong Wang. Inner product similarity search using compositional codes. 
CoRR, abs/1406.4966,2014. 

[4] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization for approx¬ 
imate nearest neighbor search. In CVPR, pages 2946-2953, 2013. 

[5] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor 
search. IEEE Trans. Pattern Anal. Mach. IntelL, 33(1):117-128,2011. 

[6] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning 
applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324,1998. 

[7] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 
28(2):129-136,1982. 

[8] Mohammad Norouzi and David J. Fleet. Cartesian k-means. In CVPR, pages 3017-3024, 
2013. 

[9] Jianfeng Wang, Jingdong Wang, Jingkuan Song, Xin-Shun Xu, Heng Tao Shen, and Shipeng 
Li. Optimized cartesian fc-means. ArXiv e-prints, abs/1405.4054. May 2014. 

[10] Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approximate nearest 
neighbor search. In ICML, 2014. 


10 




















